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Abstract 

The problem of matching unlabelled point sets using Bayesian inference is considered. 
Two recently proposed models for the likelihood are compared, based on the Procrustes size- 
and-shape and the full configuration. Bayesian inference is carried out for matching point 
sets using Markov chain Monte Carlo simulation. An improvement to the existing Procrustes 
algorithm is proposed which improves convergence rates, using occasional large jumps in the 
burn-in period. The Procrustes and configuration methods are compared in a simulation study 
and using real data, where it is of interest to estimate the strengths of matches between protein 
binding sites. The performance of both methods is generally quite similar, and a connection 
between the two models is made using a Laplace approximation. 

Keywords: Gibbs, Markov chain Monte Carlo, Metropolis-Hastings, molecule, protein, Pro- 
crustes, size, shape. 

1 Introduction 

Matching configurations of points is an important but challenging problem in many application 
areas, including in bioinformatics and computer vision. In this paper we compare two Bayesian 
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approaches that have been developed for matching unlabelled point sets. The matching problem, 
where the sets of points may be of different sizes, is relevant for the comparison of molecules and 
the comparison of objects from different views in computer vision. For example, if we have two 
protein surfaces, a question of interest is whether the two surfaces have a region of the same shape. 
This region may correspond to a binding site that the proteins have in common; for example they 
may both bind to the same protein molecule. 

In this paper we compare and build on on the Markov chain Monte Carlo (MCMC) methods re- 
cently independently developed by Green and Mardia (2006), Dryden et al. (2007) and Schmidler 
(2007), which themselves have connections with work stemming from Moss and Hancock (1996) 
and Rangarajan et al. (1997), among others. 

Green and Mardia (2006) include details of a small dataset where the problem is one of matching 
unlabelled point sets, and we use this dataset as a testbed for our comparisons. The dataset con- 
sists of the coordinates of the centres of gravity of the amino acids that make up the nicotinamide 
adenine dinucleotide phosphate (NADP) binding sites of two proteins. Protein 1 is the human 
protein 17-beta hydroxysteroid dehydrogenase. Protein 2 is the mouse protein carbonyl reduc- 
tase. The active site of protein 1 contains 40 amino acids and the active site of protein 2 contains 
63 amino acids. Green and Mardia (2006) implemented their MCMC algorithm on the protein 
data. In Table 4 of Green and Mardia (2004) (which is not in Green and Mardia, 2006) each of 
the suggested pairings between amino acids in protein 1 and in protein 2 is assigned a probability. 
These probabilities were estimated by observing how often those matches were represented in 
long runs of the MCMC algorithm after convergence, and we use these findings as a basis for 
comparing the algorithms. 

This paper consists of two main contributions. First we describe an improvement to the algorithm 
of Dryden et al. (2007) to prevent it from getting trapped in local modes in the burn-in period. 
This method involves introducing some irreversible big jumps to find a good starting point for the 
MCMC algorithm. Secondly we compare the performance of MCMC algorithms for simulating 
from two different Bayesian models: involving Procrustes matching (as in Dryden et al., 2007 
and Schmidler, 2007) and involving the full configuration (as in Green and Mardia, 2006). 



2 Procrustes model 



2.1 Match matrix 

Consider two configurations of M and iV points in m dimensions, and we write X as an M x m 
matrix and /i as a iV x m matrix of co-ordinates. In our application the configurations are 
molecules, the points are amino acid functional site centroids, and the configurations are in m = 3 
dimensions. A key part of protein molecule matching is to identify which functional sites corre- 
spond between two molecules. In chemoinformatics when comparing smaller drug molecules the 
points are atoms and it is of interest to find correspondences between pairs of atoms in molecules. 

In order to specify the labelling or correspondence between the points we use a match matrix A, 
which is a M x (N + 1) matrix of 1 s and Os, in which every row sums to 1 to represent a particular 
matching of the points in X to the points in fi. For 1 < j < N, if A^ = 1 then the i th point of X 
matches to the j th point in fx. If A^jv+i = 1 then the i th point of X does not match to any point in 
fx. Note that there is no requirement for the columns to sum to 1, and so many-to-one matches are 
allowed. Also, the matching is not symmetric - in general the match from point set A to B will 
differ from the match from point set B to A. 

We shall consider two approaches to molecule matching using different Bayesian models: a Pro- 
crustes size-and- shape model (Dryden et al., 2007; Schmidler, 2007) and a configuration model 
(Green and Mardia, 2006). The methods use Markov Chain Monte Carlo (MCMC) simulation to 
draw inferences about the match matrix and a concentration parameter, although the treatment of 
the rotation and translation nuisance parameters differs. 

2.2 Likelihood 

Given a match matrix, A, with p matching points, and configuration matrices X and /i which we 
assume have been centred, let X A beapxm matrix of the rows of X for which X^n+i = (i.e. 
the matched points in X). Let fx A be a p x m matrix of the rows of fi which correspond to the 
points in fx to which the points of X A are matched. We regard X as a random configuration and 
fx as fixed. 

A rotation of X is given by post-multiplication by a rotation matrix T e SO(m), where SO(m) 
is the special orthogonal group of m x m matrices such that T T r = IT T = I m and |T| = 1. 
A translation of X is given by addition of each row by 7 T e M. m . The size-and-shape of the 
configuration consists of all geometrical properties that are invariant under rotation and translation 



of X A , i.e. the size-and- shape of X A r + l p7 T is the same as that of X A (see Dryden and Mardia, 
1992; 1998, Chapter 8). Here l p is the p-vector of ones, and I m is the m x m identity matrix. 

We first use partial Procrustes registration to register X A to /i A , in order to define a distance 
between the size-and- shapes. This aspect of the matching is present in both the Dryden et al. 
(2007) and Schmidler (2007) approaches. The Procrustes matching involves finding f e SO(m) 
and 7 e ffi m such that 

|| ^A_X A f- l p7 T ||= inf || /-X A r- l p7 T \\=d s (X A ,fi A ), 

reso(m) 

7gRP 

where ds(X A , /i A ) is the Riemannian metric in size-and-shape space, ST^ (see Kendall, 1989; 
Dryden and Mardia, 1992, 1998). The Procrustes estimators of rotation and translation, T and 7 
are 

7 = P , f = RiRl, Ri,R 2 e SO(m), 

where (fi A ) T X A = \\X A \\ ||/i A ||i? 2 -D-Rf and D = diag(/i, h, ■ ■ ■ , l m ) is an m x m diagonal matrix 
where the eigenvalues, lj, are optimally signed > l 2 > . . . > \l m \ > 0) and non-degenerate 
(Z m _i + l m > 0), see Kent and Mardia (2001). 

Let X A = X A f + l p7 T . Then X A is the partial Procrustes fit of X A onto /i A . (It is 'partial' 
because no scaling has been used, just rotation and translation.) 

The partial Procrustes tangent coordinates of X A at fi A axe given by the p x m matrix 

V A = X A - /i A = X A f + l p7 T - // 

which is in a pm — m(m — l)/2 — m dimensional linear subspace of R mp . 

We denote the unmatched points in X by X~ A . We transform these points using the same trans- 
formation parameters as for X A . Let X~ A = X~ A T + l M „ p7 T . We consider X~ A to lie in 

I£(M-p)m_ 

Given the match matrix, A, the size-and-shape of X A lies in ST^ and X~ A lies in ]R( M -p) m . 

We assume a zero mean isotropic Gaussian model for V A in Q = pm — m(m — l)/2 — m dimen- 
sions. (There are m(m — l)/2+m linear constraints on V A due to the Procrustes registration.) We 
assume that X~ A , the non-matching part, is uniformly distributed in a bounded region, A, with 
volume |v4.| of IR m . For the protein data we use |^4| = 25500 which is the volume of a bounding 
box obtained by multiplying the maximum lengths in the x, y, z directions for each protein. 



The likelihood of X given A and r = I /a 2 , a precision parameter where a 2 is a measure of the 
variability at each point, is 

L(X\A,r,n) = f V A(V A \r,A^)f x -A(X' A \A) 

= (2 7 r)- ( 5/ 2 r«/ 2 exp(-^trace{(\/ A ) T \/ A } X 1 



\A\ m -p 



1^1 

This likelihood is given by Dryden et al. (2007) and essentially is that of Schmidler (2007) (with 
Q = mp in the latter). 



2.3 Prior and posterior distributions 

We write 7r(r) and 7r(A) for the prior distributions of r and A and assume r and A are independent 
a priori. We use the prior distribution r ~ T(a , A))- 

For the prior distribution of A, we assume the rows are independently distributed with the i th row 
having distribution 

n(\ hN+1 = 1) = V, A\j = 1) = 1 < J < N, 

for 1 < i < M and 0<V ; <l-IfV ; = ]vTT tnen ^ ^ s un if° rm ly distributed in M.m,n+i, the 
space of possible M x N + 1 match matrices. The posterior density of r and A conditional on X 
is 



EA/o°°^MA)i(A:|A,r,|i)dr- 



2.4 MCMC Inference 



The full conditional distribution of r is available from the conjugacy of the Gamma distribution, 
so we update r with a Gibbs step. 

We make updates to the match matrix using a Metropolis-Hastings step. We select a row at 
random and move the 1 to a new position in [1, . . . , A/" + 1]. In particular, if the selected point is 
already matched then it becomes unmatched with probability p re j ect , or it is matched to another 
point % with probability (1 — p re j ect ) /(N — 1). If the selected point is unmatched then it becomes 
matched to point % with probability 



We accept the new proposal, A*, with probability 

n(A*\X,n,T)q 



where 



a A = min<{ 1, 

7r(A\X,fi,T)q* 



Vr eject I (1/^0 if we are making an unmatched point matched 
q/q* = ^ (1/-/V) /p r e j ec t if we are making a matched point unmatched 

1 if we are making a matched point match a different point in protein 2. 

If p r eject = 1 /N then g/g* = 1, which was the value used by Dryden et al. (2007). 

Dryden et al.(2007) also describe a computationally faster approximate Metropolis-Hastings up- 
date to the match matrix which does not require the use of the whole configuration in the calcu- 
lation of the density. If we propose the change (i — > li) to (i — > l 2 ) then the alternative Hastings 
ratio, a* A is given by 

a* x = mm{g(xi, n h )q/(g(xi, n h )q*),l}, (1) 

where 

i-y> / T \ m / 2 



g{xi,Hj) = 



N 



(^-)^exp(-||^-^| 2 ), if j < AT + 1 
*PjX\ ifj=N + l. 



When a new match is accepted the ordinary partial Procrustes registration is carried out on the 
new matching points to ensure the configuration of matching points has rotation removed. 

For brevity we shall refer to the size-and- shape model as the "Procrustes model", and matching 
using MCMC simulation with this model as the "Procrustes method". Note that Schmidler (2007) 
uses geometric hashing for computationally fast approximate inference, which we do not consider 
here. 



2.5 Improving the Procrustes algorithm 

One of the problems with the MCMC scheme is that because of the multimodality of the likeli- 
hood function for the match matrix A, the molecules often get stuck in a local mode. In order to 
circumvent this problem Dryden et al. (2007) ran the algorithm from a number of different start 
points until the algorithm had reached a position which satisfied certain convergence criteria. 

We propose a new initialisation algorithm which involves proposing much more radical changes 
to the match matrix than changing just one row. The four types of bigger moves are called 
'nearness', 'rotation', 'translation' and 'flip'. All four types of proposal are non reversible, and 
therefore we only allow these big jumps at the start of the MCMC algorithm. Effectively the use 



of these proposals helps to find a good starting point for the subsequent MCMC inference. The 
new moves are: 



1. Nearness. Each of the matched points in X (i.e. those rows of A that have a in the last 
column) is matched to the point in p that is nearest to it. Let I A be the index of matched 
points, so I A = {i e {1, 2, . . . , M} : A^+i = 0}. We define A* = (A* ) by 

{1 i£l A ,j = N + l 
1 iel x , || (X)i - (n)j ||= mim 6{li ... )JV} || (X)i - (n)i \\ 
otherwise. 

Let N(X, p, A) = A* as defined above. Note that A* has the same number of matched 
points as A. The other three methods (rotation, translation and flip) use this nearness step 
at the end. 

2. Rotation. Randomly choose an angle 9 ~ U[—n, n]. Randomly choose an axis (x, y or z) 
about which to rotate, and set R = R x (9), R y (0) or R z (9) as appropriate, where R x , R y , R z 
are defined in © and ©. Let X* = XR then map each point in X* to the nearest point in 
p, i.e. A* = R A (X, fi, A) = N(X*,p, A). 

3. Translation. Choose 7 ~ AT 3 (0, a 2 ). Define X* = X + 1m7 T and then map each point in 
X* to its nearest point in fi. Thus A* = T A (X, li, A) = A^(X*, //, A). 

4. Flip. This move has the same form as the rotation step, but instead of selecting 9 from a 
U[—n, 7r] distribution we set 9 = tc. 

We define an initialisation phase by setting a maximum number of initial jumps, N iniUa ii sation . 
We also define a settling time, N setUe . During the initialisation phase (i.e. < N initiaHsation in- 
teractions) at least N sett i e default updates are proposed between any two big jump proposals. 
The rationale behind this is to explore the region of the parameter space we 'land in' after 
making a big jump before immediately jumping somewhere else. The hope is that the settling 
time allows the algorithm to home in on a solution if a big jump takes us somewhere close 
to the optimal solution. Provided at least N sett ie default updates have been proposed we ran- 
domly choose an update type from {nearness, rotation, translation, flip, default}, with probabili- 
ties p n , p r , p t ,Pf, 1 — (p n + Pr + Pt +Pf), say. Whichever update method is chosen, a new match 
matrix, A* is generated. We then accept the new match matrix with probability 

ctA = min{l, 7r(A*|X, /j, r)/ir(A\X, p, r)}. 
After N initia i isation iterations the algorithm proceeds exactly as described in Dryden et al. (2007). 



3 Configuration model 



3.1 Likelihood 



We now consider an alternative model for the configuration of points which turns out to be equiv- 
alent to that of Green and Mardia (2006). We again assume that /i is a fixed N x m configuration 
and X is an M x m configuration that we apply rigid-body transformations to. 

This model for the co-ordinates of the points does not involve removing rotation and translation 
by Procrustes matching. Rather, the rotation matrix T G SO(m) and the translation parameter 7 
will be parameters in the model. The matched points in X A are taken as Gaussian perturbations 
of the matching points in //, and we assume that the rows of X~ A are distributed uniformly over 
a bounded region A C W 71 of volume |„4|. We concentrate on the m = 3 dimensional case here. 

Given an M x (iV + 1) match matrix, A (with p matching points), rotation matrix T and translation 
vector 7 the likelihood is therefore defined as: 

\ 3p/2 



—J r 3 ^ 2 exp (~trace{(X A - /i A f (X A - ^ A )}) 



1 



x 



\A\ m -p' 

where X A = X A T + 1 P 7 T , T = R z {9 12 )Ry{9 lz )R x {e 2 z), and the rotation matrices about the 
x, y, z axes are: 
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- sin #23 cos #23 
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cos #12 sin 9 12 

— sin #12 cos #12 




COS #13 



- sin #13 

o\ 





sin #13 

1 

COS #13 



(2) 



(3) 



with Euler angles #12 G [—n, it), # 13 G [— 7r/2, it/2], #23 G [— n, n). There are many choices of 
Euler angle representations and all have singularities (Stuelpnagel, 1964), although the singular- 
ities have measure zero with respect to Haar measure which is given by 

1 



8tt 2 



cos(9i 3 )d9 12 d9 13 d9 2 3 



in this case (e.g. see Khatri and Mardia, 1977). 

Note that Green and Mardia (2006)'s model is constructed with X and /i as Gaussian perturba- 
tions from an underlying Poisson process. However, the likelihood is actually of the same form as 
the one sided version, where X is perturbed from /i, although the variance parameter is doubled. 



3.2 Prior and posterior distributions 



We take r, A, T, 7 to be mutually independent a priori, and the priors of r and A are taken as in 
Section [2731 We also take the prior for 7 as: 

7 ~ N 3 (n 7 ,a*I), 

and we take T to be uniform with respect to Haar measure on SO(m). The posterior density of 
(A, r, T, 7) conditioned on X is 

, Ar lx s 7r(r)7r(A)7r(r)7r(7)L(X|A,r,/x,r,7) 

1 ' ' ' 7 ' ,fl) Ea Jo°° 7r(^)7r(A)7r(r)7r(7)L(X|A, r, //, T, 7)dr' 

3.3 MCMC simulation 

The full conditional distribution of r is given by 

3v \\X A — u An ' 2 

(r|X,A,r,7, yU )~r|a + ^,/3 + lh 



2 " 2 



and so a Gibbs update can be used for r. 



We update the rotation angles using a Metropolis-Hastings step, drawing the proposal perturba- 
tions from a uniform distribution on [—0.2, 0.2] for 12 , #23, and uniform on [—0.1, 0.1] for 13 , to 
give proposed angles Q{ 2 , #13, #23- The Hastings ratio is: 



mm 



1 vr(r, A, rfe g|a, 2 * 3 ), 7 |X, /X) COS g|g 
' 7T(r, A, r(6»i2, 6*13,6*23), 7l^, At) COS 6»i3 



and the extra cosine terms are due to the Haar measure on the special orthogonal rotation group. 
The full conditional distribution of 7 is given by 

7 |X, h,t, A, T ~ AM 11/2 ' — 11/ 2 7 ' (4) 

y + 1/cr^ pr + \ jai f J 

and so we use a Gibbs update for 7. 

We update the match matrix A in the same way as in the Procrustes model using the acceptance 
probability 

. / 7r(A*|X,/i,r,r,7)g 

a a = mm 1, — — : r^— 

V '7r(A|X,/i,T,r,7)g* 

. , L(X|A%/i,r,r, 7 )7r(A*)g 
mU " ' L(X|A, yU ,r,r,7)7r(A)g* 



Suppose A* contains the match (z — > li) and A contains the match {i — > l 2 ), where li ^ l 2 and 
the match matrices A* and A are otherwise identical. The acceptance probability a\ is exactly 
the same as that given in Equation ([I]), i.e. the fast method of Dryden et al. (2007). Hence the 
MCMC updates of A for the Procrustes and Configuration models are more similar than they first 
appear. 

Note that our implementation of the MCMC simulation differs slightly from Green and Mardia 
(2006) who use a matrix Fisher conjugate prior for the rotation, and update two of the rotation 
angles with a Gibbs step. In addition, Green and Mardia (2006) ensure that the matching is 1-1 
between the points, whereas we do allow the possibility of many-to-one matches. 

For brevity we shall refer to this model as the "Configuration model", and matching using this 
model as the "Configuration method". The Configuration model has been demonstrated to work 
well in a variety of situations (see Mardia et al., 2007). 

3.4 Laplace approximation 

Let us consider the posterior density 7r(A, t, T, 7|X). Note that the rotation and translation T, 7 
are nuisance parameters, and one has a choice about how to deal with them. In the Configuration 
approach one samples from the full joint distribution of (A, r, T, j\X) and so joint inference of 
all the parameters can be carried out. However, if T, 7 are considered nuisance parameters then 
we can integrate them out to give the marginal density of (A, r) 



In the Procrustes approach the match is obtained by optimizing over the nuisance parameters, and 
so we consider the different posterior density based on 



We can consider © to be an approximation to the marginal density © where the integral is 
approximated using Laplace's method (Tierney and Kadane, 1986). 

From a Bayesian analysis perspective it is natural to work with the marginal posterior distribution 
©. From a shape theory perspective the analysis should be invariant under rotations or transla- 
tions of the data, and so a uniform prior for T, 7 in © or a distribution of the form © are both 
natural. In this paper we will explore the relative performances of the two approaches in some 
practical scenarios. 




(5) 



7T P (A, t\X) oc sup 7r(A, t, T, j\X). 



(6) 



4 Applications and simulations 



4.1 Assessment of initialisation procedure 

Here, we use the NADP-binding site protein data to assess the efficiency of the Procrustes algo- 
rithm, both with and without the large jump proposals. There are 40 centres of gravity of amino 
acids for protein 1 and 63 for protein 2. Following Green and Mardia (2006) we take the prior 
hyperparameters to be a = 1, /3 = 36, /i 7 = 0, <r 7 = 50, and we take ip = 0.2. The proposal 
parameters p n = 0.001, p r = 0.02, pj = 0.01, pt = 0.09, N sett i e = 850 for this application. 

We used the a priori 'correct' matches, as identified in Green and Mardia (2004) to define a 
convergence criterion. To assess the efficacy of this criterion for determining convergence, we 
started 50 MCMC runs from distinct initial configurations in each of which 10 correct matches 
were selected at random. Each run was allowed to run for 50000 iterations, and we measured 
the number of correct matches after each 1000 iterations. The results are shown in Figure [Q In 
all 50 cases, for both the Procrustes and the Configuration models, the algorithms converged to 
around 36 correct matches. It is interesting to note that the Procrustes model converges quicker 
and more reliably, although with the large initialization proposals this is not surprising (see the 
variance plots in Figure CD- 

INSERT FIGURE 1 ABOUT HERE 

To compare the convergence performance of the Procrustes and Configuration methods, we ini- 
tiated 25 runs from random starting points. We allowed each run to continue for a maximum 
of a million iterations, monitoring the number of correct matches after every thousand itera- 
tions. On the basis of the results described above, we stipulated that if within these million 
iterations the number of correctly matches reached 10 then that counted as convergence. Such 
runs were allowed to continue for a further 50000 iterations. The Procrustes method was used 
both with and without the big jumps described above; these were only used during the initial 
N initialisation = 1000000 iterations. Figure [2] shows histograms of the number of iterations before 
the algorithms converged to 10 correct matches for the successful runs. The success rates of 10/25 
for the Procrustes method without big jumps and 6/25 for the Configuration method were not too 
encouraging. However, when big jumps were included for the Procrustes method, the success 
rate increased to 22/25, a very impressive result. 

INSERT FIGURE 2 ABOUT HERE 

In Green and Mardia (2006), they report convergence within a million iterations on 83 out of 100 



tests run from random starting points. They define convergence differently to us, looking for runs 
in which the log-posterior goes higher than some threshold. It is important to note three things 
when looking at this result and comparing it with the results of Figure El Firstly, in the Green and 
Mardia paper, they update the match matrix 10 times per sweep, so they are effectively looking 
at the convergence within 10 million iterations. Secondly, their proposal methods for the angles 
in particular are different; they use Gibbs steps instead of Metropolis-Hastings updates, making 
use of conjugacy of the matrix Fisher distribution. This may also improve their convergence 
performance, with the form of the proposals being closer to the true distribution. Finally, the way 
the model is formulated is different, with 1-1 matches and a hidden Poisson process being used. 

Although the algorithm was much more likely to converge within a million iterations if the big 
jumps were included, it did mean that from certain starting points the algorithm took a lot longer 
to converge if the big jumps were included than if they were not. This is a consequence of the 
choice of the settling time parameter between large jump proposals. One way to avoid this might 
be to let the algorithms run for an initial period of 100000, say, before introducing any big jumps. 
This way, if the algorithm converged within that period then it would not be necessary to use the 
big jumps at all. Also, the settling time between large jump proposals could be increased. Despite 
the fact that it often took longer for the algorithm to converge with the large jumps, the evidence 
is compelling that the big jumps vastly improve convergence. 

We experimented with the probabilities of acceptance for the four types of large jumps. At the 
levels we settled on (given in the caption of Figure [2]) the nearness proposal was always accepted 
(which is always the case since the likelihood always increases for the nearness proposal), and the 
other three types were accepted roughly a quarter (flip), a third (rotation) and half (translation) of 
the times when they were proposed. 

4.2 Long run comparisons 

In order to compare further the Procrustes and Configuration algorithms we apply the MCMC 
scheme from a number of long runs of the method. In order to ensure that we started the algo- 
rithms close to convergence, we initialised the proteins by aligning the first 10 pairs of amino 
acids as given in Table 4 of Green and Mardia (2004). 

We ran the two algorithms and looked at the proportion of the accepted match matrices after 
convergence in which particular matches were represented. Although in principle many to one 
matches were possible, they did not tend to occur in the long runs after convergence. We ran the 
experiment for five values of ip, the prior probability of a particular point being unmatched, and 



five values of the proposal probability p Te ject, the probability of moving a matched point to an 
unmatched status in the proposal for the change to the match matrix. For each parameter the five 
values we used were 0.001, 1/63, 0.1, 0.2 and 0.4. (The 1/63 is there because N = 63 and in the 
case of ip, this corresponds to a uniform prior for A.) 

Altering p Te j ect had little effect on the results. We fix p re j ect = 0.2 and consider the effects of 
varying ip, the prior probability of each point in protein 1 being unmatched (independently of the 
other points). We ran each MCMC algorithm for 1000000 iterations after convergence, adding the 
match matrices together. In Figure[3l we show how often the 36 most likely matches from Table 4 
of Green and Mardia (2004) appear in our match matrices after convergence. These percentages 
are calculated as the number of times each match occurred divided by the total number of match 
matrices. 

INSERT FIGURE 3 ABOUT HERE 

We have calculated a 'threshold match matrix' by putting a 1 in each position that corresponds 
to the maximum entry in a row of the summed match matrices and a everywhere else. This 
gives us a method for comparing how many points are matched for each value of ip. For values of 
ip G {0.001, 1/63, 0.1, 0.2, 0.4} the number of unmatched points are {0, 0, 1, 4, 4} respectively, 
for both the Procrustes and Configuration methods. Clearly changing the prior distribution of A 
by altering ip has an effect on the number of points that are matched. 

Figure [3] shows that using the Configuration model, we obtain probabilities for the top 36 matches 
reported in Green and Mardia (2006) that are similar to the figures quoted in that paper. How- 
ever, using the Procrustes model, the probabilities are all significantly closer to 1 than using the 
Configuration model. This suggests that the Procrustes model is 'stickier' than the Configuration 
model, in the sense that matches are released less readily after convergence. The simulation study 
below investigates the relationship of long run convergence probabilities with different variances, 
and the results suggest that there is a possibility that the results observed in Figure [3] may be a 
contingent property of the variability of the points. We return to this in the discussion of the 
simulation study. 

Note that the posterior standard deviation a = 1/ y/r was smaller for the Procrustes model. In 
particular, the means of the 10000 values well after burn-in were 0.869 for the Procrustes model 
and 1.355 for the Configuration model. 



4.3 A simulation study 

We consider now a simulation study where we know what the true probabilities of matching are 
and compare the MCMC algorithms both with and without Procrustes registration to see how they 
perform. The details of this simulation are as follows: 

Step 1 Define a length, L > and a minimum distance < d m i n < L. Fix M, N £ N, 
n ones < M. As before, M is the number of points in the point set X and iV is the number 
of points in the point set fj,. Define a vector of probabilities, p = (pi,P2, ■ ■ ■ ,Pm), where 
Pl = p 2 = ... = Pnones = 1 and ^ = for % = n ones + 1, . . . , M. Fix s < d min ; this is 
the standard deviation of the pertubations of the random points. 

Step 2 Sample the N points of \i from a uniform distribution on the cube with corners 

{(-L, -L, -L), (-L, -L,L),..., (L, L, L)} 

subject to the constraint that each new point is at least a distance d min from every other 
point. For the i th point in X, denoted (X)i, if pi — 1 then we sample from a Normal 
distribution centred on the i th point in ji, 

(X) i ~iV 3 ((^, S 2 /3), 

else we sample uniformly from the cube with corners as above, 

(X)i ~ [/[cube as above]. 

Step 3 Run the two MCMC algorithms for N iter iterations starting from the match matrix which 
matches (X)j to (p)j for j = 1,2,..., n ones . (In other words we start the algorithms from 
convergence.) For i = 1, . . . , , n ones , record the proportion of the iV iter match matrices 
that match (X)j to {p)i. For i = n ones + 1, . . . , M, record the proportion of the N iter 
match matrices for which (X), is unmatched. 

Step 4 Hold fx constant and sample a new X as described in step 2. Repeat step 3. Continue this 
process until the proportions of successful matches and successfully unmatched points 
have been recorded for K runs of the MCMC algorithm. 

Step 5 Repeat experiment for various values of s < d min . 

Figure H] shows the results of running this experiment with M = 20, iV = 24 and n ones = 12. 
The values chosen for L and d min were 10 and 2 respectively. The experiment was run for four 
values of s, the standard deviation parameter. These were <i min /20, d min /10, d min /5, d min /2, or 
0.1, 0.2, 0.4 and 1. The value of N iter , the number of iterations after convergence, was 100000. 



INSERT FIGURE 4 ABOUT HERE 

Figure @]has a curious feature. When the value of the standard deviation is less than or equal to 
dmin/5, the Configuration model seems to estimate the probabilities for both matched and un- 
matched points more reliably than the Procrustes model. For both models the matched points are 
rarely released when the matching is very precise, but the Configuration model gives probabilities 
closer to 1 than the Procrustes model. (This is not clear from just looking at the graphs). When 
the standard deviation is increased to d min /2, the Configuration model still performs better than 
the Procrustes model on the unmatched points. Interestingly, now the Procrustes model gives 
significantly better (i.e. higher) estimates for the probabilities for the matched points. 

With reference to the results illustrated in Figure [31 this simulation study poses an interesting 
question. In Figure [31 we found that the Procrustes method appeared 'stickier' than the Configu- 
ration method. In the light of the findings of this simulation study, it is possible that this result is 
a feature of the particular relationship between the variance parameter and the minimum distance 
between points in this particular dataset. From the simulation, it appears there may be a critical 
value of the standard deviation parameter, somewhere between d min /5 and d min /2, for which the 
two MCMC methods swap over in terms of which one gives the higher probabilities for particular 
matches. 

5 Discussion 

In conclusion, it is clear that the Procrustes method is significantly improved by considering the 
initial large jumps. However, despite quite extensive comparisons there is not an overall prefer- 
ence between the Procrustes or Configuration methods for all situations. The Procrustes method 
appears to converge more reliably to the true solution when the proteins are initialised by select- 
ing 10 correct matches at random. This is a consequence of the optimisation over the rotation and 
translation parameters that takes place in the Procrustes method. However, for simulated datasets 
where the variance is small, the Configuration method more reliably predicts the probabilities 
of matches, and the Procrustes method was more likely to suffer from false matches. For larger 
variances the Procrustes method was more effective at estimating correct matches, without more 
false matches. In essence both models are fairly similar, and inference using marginal posteriors 
© or © is similar in practice due to the Laplace approximation. 

Although we have just considered pairwise matching of two configurations here, the methods 
extend to matching multiple molecules. Extensions of the Procrustes and Configuration models 



for multiple alignments have been given by Dryden et al. (2007) and Ruffieux and Green (2008) 
respectively. 

The way we have set up the MCMC procedures, we do not exclude the possibility of many-to-one 
matches. We have followed the methodology of Dryden et al. (2007) and found that in general 
many-to-one matches are not selected in long runs after convergence. However, it would be easy 
to constrain the choice of match matrices such that only one-to-one matches were proposed. This 
is the method adopted by Green and Mardia (2006). 

MCMC tools are an effective way of finding the optimal correspondence and registration between 
two point sets where we wish to match a subset of points from one set to a subset of points from 
the other set. But because of the combinatoric nature of looking for possible correspondences, 
the algorithms are currently prohibitively time consuming for large data sets. Suppose we were 
interested in comparing two large protein surfaces to look for regions of a similar shape (such 
as binding sites that are common to both proteins). It may be possible to use an efficient search 
algorithm to scan the surface of the two proteins for small regions that are potential candidates 
for binding sites and then apply the MCMC methods to those small sites individually to confirm 
whether or not there are subsets of the two regions that match well. Schmidler (2007) notes 
the difficulties of using MCMC methods for large problems and suggests the use of geometric 
hashing to compute approximate posterior quantities efficiently. 
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Figure 1 : A comparison of the numbers of correct matches over 50000 iterations (as defined in 
Green and Mardia (2004) for the Procrustes and Configuration models, initialising by choosing 
10 correct matches at random 



Size-and-shape model with big jumps 



Size-and-shape model without big jumps 



successes=22/25 



ri mm nn 



— i — i — i — i — i 

200 400 600 800 1000 
iterations 
(000s) 



successes=10/25 



ML 



— i 1 1 1 

200 400 600 800 
iterations 
(000s) 



Configuration model with big jumps 

successes=6/25 



o u 

I I I 
20 40 60 80 
iterations 
(000s) 

Figure 2: Histograms of the number of iterations to convergence in successful runs from random 
starting points for the Procrustes model, with and without large jumps, and the Configuration 
model. The parameters for the large jumps are: a T = 2.2, p n = 0.001, p r = 0.02, pf = 0.01, p t = 

0.09, N se ttle = 850. 
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Figure 3: The proportions of match matrices containing particular pairings, based on 1000000 
iterations after convergence for the Procrustes and Configuration models for five values of ip - a 
comparison with the percentages quoted in Green and Mardia (2006). 
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sd=d.min/20, configuration sd=d.min/10, configuration sd=d.min/5, configuration sd=d.min/2, configuration 
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Figure 4: The means (circles and squares, left hand scale) and variances (stars, right hand scale) 
of the proportions of successful matches (black circles and black small stars) and successfully 
unmatched points (red squares and red large stars) with and without Procrustes registration on 
long runs (100000 iterations) after convergence. Here, there are 20 points in configuration X and 
24 points in configuration ji. The points in // are sampled uniformly from a cube of side length 20 
subject to the constraint that they are a minimum distance d m % n = 2 from the nearest neighbour. 
The first 12 points in X are sampled from Normal distributions centred at the corresponding 
points in [i and the last 8 points in X are sampled uniformly on the cube of radius 2L. The means 
and variances are calculated over 100 runs, with /i held constant and X resampled each time. 



